This is a 10 hours course to teach the basics of Spark to students who already are familiar with coding and machine learning.
The examples will all be given in python.
Go to the creator of this tool: https://github.com/databricks/Spark-The-Definitive-Guide
Download the whole repo. This will also provide the data used in the examples.
[SDG] The book: Spark the Definitive Guide
[2] https://luminousmen.com/post/spark-tips-partition-tuning
Either use the free community edition of Databricks (https://community.cloud.databricks.com/)
or run locally on your PC (instructions are provided for linux/windows/Mac)
- understand the concepts
- practice simple operations
- get basic familiarity with configuration and tuning
- run simple machine learning models
What is horizontal scaling and vertical scaling?
Apache Spark is an open-source cluster computing framework.
Built on top of Hadoop MapReduce.
Utilizes In-memory computing.
Originally developed at UC Berkeley (2009).
In a real production environment, Databricks managed cluster can be used (in the cloud), or MS HDInsight. We can also install our own Spark cluster, locally or in the cloud. The number of computers can reach thousands in the cluster.
During this course we will use a minimal installation on your own PC/Mac/linux machine.
Instructions are here: https://github.com/cnoam/spark-course/blob/master/readme.md
While developing: on my laptop "~/videos/spark videos"
Occasionally, you will have opportunities to check your knowledge. Try to answer/solve/execute all the questions. They will help you make sure you are ready for the next part!
- Explain the difference between horizontal and vertical scaling
- Check "Cluster" definition. Does it match what we have in Spark?
Technical details on recording: Ubuntu 22.04 kazam full screen recording use extension manager to hide desktop icons